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INTRODUCTION 

In  an  experiment  which  is  designed  to  con^jare  specific  treatments, 
varieties,  or  methods,  one  may  wish  to  decide  which  population  means  are 
actually  equal,  and  which  are  unequal.  In  the  various  fields  of  agriculture, 
industry,  and  physical  or  biological  science,  one  often  is  confronted  with 
such  problems;  therefore  it  becomes  important  to  have  test  procedures  vDhich 
have  certain  desirable  properties.  It  often  is  also  desirable  to  know  how 
convenient  it  is  to  use  the  suggested  test  procedures. 

In  the  early  1920' s  the  British  statistician  R,  A,  Fisher  introduced  a 
statistical  technique  for  analyzing  experiments  called  the  analysis  of  variance, 
In  general,  the  analysis  of  variance  cpnpares  possible  sources  of  variation  . 
in  the  experimental  units  with  an  appropriate  measure  of  random  sampling 
error,  and  leads  to  decisions  to  accept  or  to  reject  appropriate  statistical 
hypotheses  regarding  population  parameters.  The  F-test  commonly  is  used  to 
make  these  decisions.  In  other  words,  this  is  a  statistical  technique  for 
analyzing  measurements  depending  on  several  kinds  of  effects  operating  simul- 
taneously, to  decide  which  kinds  of  effects  are  important,  and  to  estimate 
these  effects. 

Before  an  analysis  of  variance  can  be  conducted,  it  is  essential  to 
know  the  properties  of  the  observed  values,  i,e,,  the  nature  and  distributions" 
of  the  observed  values.  If  r  observations  x^^,  X2,  .  .  .,  Xj.,  assumed  to  be 
on  r  random  variables,  are  linear  combinations  of  m  unknown  effects  0(1, 
e<  2>  •  •  •>  Otm  P^^^  errors,  €  ]^,  €2,  .  .  .,  €  j.,  one  usually  can  express 
X  in  the  form  of  a  mathematical  model  such  as: 

3Ci  =  o<3^  +  e<2  +  .  .  .  +  tffj^  +  €  i  (i  =  1,  ,  ,  „  r) 


The  o<  's  are  more  or  less  idealized  formulations  of  some  aspects  or  the 
observations  which  underly  the  phenomena  of  interest  to  the  investigators. 
If  the  ^o(  .J  measure  the  unknown  effects  in  some  desciribable  vray  they  can 
be  defined  as  parameters.  A  model  is  called  a  fixed-effects  model  if  all 
these  o<  .  are  unknown  constants.  A  model  in  which  all  0^  ^     are  random 
variables,  except  one  which  is  used  to  represent  the  "general  mean"  the 
model  is  called  a  random-effects  model.  One  other  model  called  a  mixed- 
effects  model  is  a  combination  of  the  fixed  and  random  models  in  which  at 
least  one  effect  is  I'andora  and  at  least  one  is  a  fixed  effect.  'Fixed  effects' 
means  that  the  treatments  involved  are  the  only  treatments  in  the  experiment 
in  which  investigators  are  interested  in  the  amount  of  effects,  'Random 
effects'  means  the  treatments  are  chosen  randomly  or  systematically  from  a 
large  group  of  treatments  and  the  investigator  is  interested  not  only  in 
those  particular  treatments  but  also  in  the  whole  group  of  treatments, 

A  multiple  comparison  test  is  applied  to  fixed-effects  because  it  only 
concerns  testing  differences  among  various  treatment  means  involved  in  the 
experiment;  and  this  implies  fixed  effects  nirith  their  corresponding  means, 

A  one  way  classification  for  a  fixed-effects  model  refers  to  the  ef- 
fects of  inequalities  among  the  true  means  of  several  (univariate)  treat- 
ments, ^j_,  ;i2,  .  .  .,  /ijj.  It  is  assumed  that  the  k  treatments  have  a  com- 
mon  variance,  (J^*",  and  that  independent  random  samples  of  equal  size  r 
were  taken  from  the  k  populations.  This  can  be  expressed  in  the  mathe- 
matical model: 


where 
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€^  are  NID(o,  (T  ), 

ji.  =  population  mean 


^  =  (;ii  -  ;i)  =  fixed  effect  for  the  ith  treatment. 
The  null  hypothesis  which  the  F-test  rejects  or  accepts  at  a  stated  level 
of  significance  (=  0<  )  is  / " 

"o  •  ;^1  "  ;^2  =  •  .  .  =  ;ilc 
or  equivalently 

Hq  :  Ti  =  0  for  all  i  =  1,  .  .  .,  k 

If  the  F-test  rejects  the  null  hypothesis  there  remains  the  problem  of 
deciding  which  treatment  means,  are  different  from  each  other. 

One  problem  in  statistical  testing  is  the  determination  of  the  proba- 
bility of  rejecting  the  null  hypothesis  when  it  is  true.  This  probability 
is  called  the  probability  of  a  Type  I  error.  The  other  sort  of  error  possible 
is  the  acceptance  of  the  null  hypothesis  when  it  is  false.  One  must  also 
determine  the  probability  of  this  error  which  is  called  the  probability  of 
a. Type  II  error.  These  probabilities  can  be  determined  easily  when  specific 
alternative  hypotheses  are  involved.  These  probabilities  become  difficult 
to  compute,  however,  when  one  is  interested  in  equality  or  inequality  of 
several  treatments  as  considered  for  example  in  tests  of  all  two  con?)arisons. 


This  report  deals  vdth  various  procedures  designed  to  make  realistic 
con5)arisons  among  treatment  means.  These  procedures  have  been  based  on 
somewhat  different  approaches  because  several  logical  points  of  view  are 
possible  regarding  the  relative  importance  of  the  two  kinds  of  errors.  V/hat 
balance  should  one  strike  between  the  probabilities  of  Type  I  and  Type  II 
errors?  Some  multiple  conparison  procedures  described  in  the  literature  are 
very  cautious  about  Type  I  errors.  Others  try  to  sacrifice  some  of  this 
caution  in  favor  of  lower  probabilities  of  !i^e  II  errors.  Other  procedures, 
however,  are  intermediate  testing  procedures  because  they  tend  to  balance 
excessive  Type  I  and  Type  II  errors. 

The  purpose  of  this  report  is  to  discuss  various  properties  of  each 
of  several  procedures  separately,  illustrating  their  application  with  an 
example,  and  discussing  some  of  their  differences.  The  Monte  Carlo  technique 
was  used  to  compute  power,  the  probability  of  not  committing  a  Type  II  error, 
and  protection,  the  probability  of  not  committing  a  Type  I  error,  for  three 
test  procedures;  Fisher's  LSD,  the  multiple-t  test,  and  Duncan's  new  multiple 
range  test.  These  are  probably  the  most  widely  used  multiple  comparison 
procedures  in  experimental  statistics. 

GENERAL  DISCUSSION 

Before  the  properties  and  application  of  various  tests  are  considered, 
it  is  helpful  to  discuss  some  of  the  general  problems  and  some  of  the  concepts 
related  to  Type  I  and  Type  II  errors  which  are  involved  in  multiple  conparison 
test  procedures. 


The  Multiple  Decision  Problem 

For.  some  experiments,  it  is  not  only  necessary  to  determine  equalities 
among  means;  but  also,  if  inequalities  exist  to  determine  the  magnitude  of 
the  inequalities.  For  a  two-mean  exan^jle  the  following  decisions  besides 
equality  are  possible: 

-^2  <  ^1       • 
For  three  means  (n  »  3) 

H  <     P3     <?2 

;^3  <  ^2  <  ;^i 

^L  <  ^2  '  /3 

;^2  <  .^1  =  >^3  . 

;^3  <  ^1  =  >^ 
;^i  =  ;^  <;^3 

/I  -   >^3  <  ^ 


n,     <    lu  }  but  ji^  cannot  be  ranked  relative  to 

/I  °^J^2  / 

Ut  ^  u^  ,  but  Up  cannot  be  ranked  relative  to 
/^l  °^>^3 

Up  ^  Pq  ,  but  jju   cannot  be  ranked  relative  to 

jiy    ^   jju.  f   but  u-  cannot  be  ranked  relative  to 

fi     ^    ji.^,   but  ^  cannot  be  ranked  relative  to  ' 

^o  ^  ^2  »  but^  cannot  be  ranked  relative  to 

A  total  of  19  possible  decisions,  including  Pi  =  Jio  ""  Z^'?*  ^^"  ^®  made. 
These  19  possible  decisions  are  well  explained  and  illustrated  by  Duncan 
(1)  with  his  geometric  method.  In  the  case  of  three-mean  comparisons  he 
is  able  to  represent  all  decisions  in  a  two-dimensional  sample  space, 
Diincan  shows  properties  of  symmetry  in  con^sarisons  among  three  means.  As 
the  number  of  treatment  means  increases,  the  number  of  decisions  increases 
very  rapidly,  complicating  the  decision  processes  considerably.  In  the 
general  case  with  n  means  there  are  n  1  decisions  of  the  form,  u,  <  Up, 
•  *  •  <  /n'  ^^"^  ^^  '/^  •  "decisions  of  the  form  Pn  "  )^   4.  J^-i  <.  )i]    <  .,  ^  n 
with  one  pair  of  means,  equal  (n-2)n!/31  decisions  of  the  form  /t  = /p  » 


n  <  U|  ,  .  ,  <^  ,  with  three  eqxial  means  (n-2)n!  decisions  of  the  form 
Ji-j_  =  )i2   and  )i2   =  /i^  but  ^^  <  ^  <^,  <  .  .  .  <.}^^,   with  two  overlapping 
pairs  of  equal  means  etc. 

In  making  comparison  among  a  set  of  means,  two  factors  must  be  kept 
in  mind:  l)  Con^jarisons  can  be  made  on  a  per  treatment  basis  or  2)  on  a 
group  basis,  i.e.,  some  treatment  means  are  grouped  together.  Hence  it  is 
very  important  that  the  procedural  method  be  determined  prior  to  running  an 
experiment.  Attempting  to  group  treatments  after  an  experiment  has  been 
performed  might  distort  the  probability  statements  one  yrishes  to  make  from 
the  experiment. 

Protection  Levels  Against  Type  I  Error 

Several  multiple  con5)arison  tests  take  different  points  of  view  toward 
committing  Type  I  and  Type  II  errors,  and  this  again  essentially  is  the  reason 
for  differences  among  the  tests  which  have  been  proposed.  The  probability 
of  a  Type  I  error  is  called  o4,  i.e.,  for  a  two  mean  comparison 


o(  =  Pr  £  decision  (;i^  ¥  P2^l  Pi  "  ^2! 


The  o(  also  is  called  a  significance  level  when  two  or  more  means  are  compared. 
The  protection  level  is  then  defined  as  the  probability  of  not  committing  a 
Type  I  error.  In  the  Student-Newman-Keuls  test  protection  level  is  kept  at 
the  same  level,  namely,  (l  -  o<  )  for  the  two-mean,  three-mean,  ,  ...,  n-mean 
comparison.  On  the  other  hand,  Duncan,  Tukey,  and  Scheff'e,  among  others  be- 
lieve that  the  multiple-t  test  has  too  low  a  protection  to  be  a  satisfactory 
multiple  con5)arison  test,  Duncan  (1)  taking  an  intermediate  position,  sug- 
gested the  use  of  special  protection  levels.  The  special  protection  levels 


involve  degrees  of  freedom  and  are  given  as 


where 


T-r.-T^r^ 


T'j-  (i-c() 

e<  =  significance  level 

p  =  number  of  means  in  the  subset. 

It  is  called  "a  p  mean  protection  level  and  is  the  minimum  probability  of 
finding  no  wrong  significant  differences  among  p  observed  means."  Duncan 
said,  "first  it  should  be  noted  that  if  a  symmtric  test  with  optimim  power 
functions  were  constructed  subject  only  to  a  restriction  on  the  value  JKp* 
the  higher  order  protection  levels  would  almost  invariably  be  too  low  to  be 
satisfactory.  ••••   The  four-mean  protection  level  of  this  multiple  normal- 
deviate  test,  as  it  may  be  termed,  will  be  seen  later  to  be  only  ?*r  =  19»1%» 
That  is,  the  minimum  probability  of  finding  no  wrong  significant  differences 
between  the  four  means  is  only  19,1%,     This  is  too  low  to  be  satisfactory. 
The  three-mean  protection  levels  in  the  same  test  have  the  value  y*-^  -   87.8^, 
which  is  also  too  low.  On  the  other  hand,  it  does  not  necessarily  follow 
that  all  of  the  high  order  protection  levels  should  be  raised  to  the  value 
7»  2  o^  "^^^   two-mean  protection  level  as  some  writers  have  implicitly  assumed. 
Any  increases  in  the  latter  levels  must  necessarily  be  made  at  the  expense 
of  losses  in  power  and  it  is  most  important  that  the  levels  be  raised  no 
more  than  is  absolutely  necessary,  • • • "   This  reasoning  is  developed  with 
an  experimental  example  (l). 


Some  of  these  numerical  levels  are  shoimi  below:       ' 

Multiple-t  test  (U)  Duncan's  NMRT 

P  =  3         87.8je  90.25^ 

P  »  U        79.7^  85.7^ 

Power  of  a  Test 

The  power  of  a  test  is  the  probability  of  rejecting  H  when  H  is 
false,  i,e.. 

Power  =  1  -  ^,  ^ 

where  p  =  probability  of  a  Type  11  error. 

In  studying  the  power  of  multiple  comparison  tests  one  is  confronted 
with  the  difficulty  pointed  out  by  Duncan  that  none  of  the  conparisons, 
even  when  two  means  are  involved,  is  a  two-decision  procedure.  The  power 
function  applied  in  this  problem  is  Neyman  and  Pearson's  (12),  which  how- 
ever is  defined  as  a  power  function  strictly  based  on  a  two-decision  concept. 

liVhen  two  means  are  corapared,  three  decisions  are  possible:  >ii  =  Pp  , 
^1  ^  ^2*  ^^^  y^2   ^  P^l*     ^  order  to  avoid  making  separate  decisions  for 
the  two  inequalities,  one  tests  the  null  hypothesis  against  the  alternative 
H"^  H'     ^^®  *®^*  ^°^  ^  6^"^®"  ^  0  le'^el,  is  two-sided  to  account  for  the 
two  inequality  statements.  The  power  is  the  probability  of  the  decision 
expressed  as  a  function  of  the  true  difference  d  =  ;i-|^  -^Xg.  A  power  function 
for  this  case  is  illustrated  by  the  dotted  line  in  Fig.  1. 

Duncan  (l)  criticized  this  procedure.  He  reasoned  that,  "by  pooling 
the  probability  of  the  two  decisions  {r^:^</x^)   and  {jx^   <yi^)  for  any  given 
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value  of  the  true  difference,  it  combines  the  probability  of  the  correct 
decision  (that  u-,  or  iio  is  the  higher  means  as  the  truth  may  be),  with  the 
probability  of  the  most  incorrect  decision  (that  u-,  >^  when  in  fact  ^  > 
ji^,   or  ^2  ^  J^i:   ^^^^   i^  fact  ii,  ^  ji^),       A  function  which  combines 
probabilities  of  serious  errors  in  this  way,  is  of  no  value  in  measuring 
desirable  or  undesirable  properties," 

To  overcome  this  difficulty,  Duncan  (l)  suggested  that  a  three -dec is ion 
test  concerning  two  means  can  be  changed  to  a  joint  application  of  two 
two-decision  tests  which  would  be  tested  by  the  hypotheses,  ^-,     ^   ^2 
against  the  alternative  jx^  <  jx-,,   and  ji^    £  Pi   against  the  alternative 
^-i  <  ^,  Therefore,  in  a  two-mean  comparison  two  Neyman-Pearson  power 
functions  are  required  for  obtaining  the  power  of  a  test.  The  power  curve 
by  this  concept  is  illustrated  by  the  sigmoid  and  reverse-sigmoid  curves 
respectively  in  Fig,  1  where  o(  for  each  one  is  1  o( 


^'^^/'r}^ 


Fig.  1,     Power  Function  for  o<Q-Level  Symmetric  Test 
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From  his  reasoning  for  the  three-decision  case,  Duncan  generalized 
to  the  case  of  n  means.  The  number  of  power  fimctions  for  n  means  is 
n(n-l)/2.  However,  Duncan  also  commented  that  instead  of  using  all  the 
power  functions  involved  this  number  can  be  reduced  because  of  symmetrical 
properties  established  in  making  comparison  of  the  means.  If  this  condition 
exists,  only  one  of  the  n(n-l)/2  power  functions  need  be  investigated  in 
order  to  investigate  them  all. 

An  increase  in  o(  will  generally  result  in  an  increase  in  the  power 
and  consequently  a  decrease  in  ^  when  the  sample  size  is  fixed.  The 
selection  of  a  particular  test  procedure  depends  upon  the  nature  of  the 
consequences  of  the  decisions,  i,e,,  whether  one  would  rather  tolerate  a 
relatively  high  Type  I  error  or  a  relatively  high  Type  II  error, 

FISHER'S  LSD  TEST 

In  1935  Fisher  (5)  proposed  a  very  simple  test  procedure,  later  called 
the  Least  Significant  Difference  (or  LSD)  test,  which  has  been  widely  used 
in  experimental  statistics.  This  test  procedure  is  just  a  repeated  use  of 
Student's  t-test  for  two  mean  comparisons  after  H  has  been  rejected,  A 
quantity  ■ 


I5D  =  s-  fT     t  .  ^   , 


is  computed,  where 


-A 


Error  Mean  Square    j..    .   ^   , 
=  the  standard  error  of  a 


n 


mean^ 
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n  =  Niiniber  of  observations  for  the  mean 

t  of       a  theOi  -level  significance  value  of  t  with  f 
degrees  of  freedom, 

f  =  the  niunber  of  degrees  of  freedom  for  the  Error  Mean 
Square  in  the  analysis  of  variance, 

Fisher  suggested  that  this  LSD  be  compared  vdth  all  differences  between 
pairs  of  sample  means.  If  the  difference  between  any  pair  of  means  is 
greater  than  the  value  of  LSD,  it  can  be  said  that  u.  and  ji^     are  dif- 
ferent means.  The  number  of  such  con5)arisons  among  n  treatment  means  is 

Co  =  "("  -  1)  . 
n  2   ^ 

The  following  example  is  based  on  Keuls'  (8)  cabbage  experiment, 
A  suitable  area  was  divided  into  39  plots,  grouped  into  3  blocks  of  13  plots 
each.  In  each  block  the  13  varieties  to  be  investigated  were  planted  in  a 
randomized  block  design,  Keuls  stated,  "The  purpose  was  to  learn  which 
variety  would  give  the  highest  gross  yield  per  head  of  cabbage  and  which  the 
lowest,  in  other  words  to  find  approximately  the  order  of  the  varieties  ac- 
cording to  gross  yield  per  cabbage,"  However,  this  report  will  use  Keuls' 
example  as  illustrating  the  application  of  all  the  test  procedures  to  be 
discussed.  Table  1  shows  the  coded  data  and  the  means  for  each  variety. 
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Table  1.  Keuls'  (8)  Cabbage  Experiment.  Coded  Gross  Weights 


• 

• 

•                                                 1 

Variety 

;       Block  A 

;      Block  B 

■ 

;       Block  C 

• 

',       Average        ] 

\       Rank 

1 

89 

i;6    ■ 

33 

176.0 

1 

2 

0 

-5 

-21 

111.3 

11 

3 

-25 

-16 

-26 

97.7 

13 

h 

22 

7 

-3 

128.7 

8 

5 

8 

6 

-12 

120.7 

10 

6 

31 

10 

-5 

132.0 

5 

7 

5U 

15 

-U 

IU.7 

1; 

8 

-8 

-20 

-30 

100.7 

12 

9 

31 

7 

-5 

131.0 

6 

10 

26 

lii 

-27 

12U.3 

9 

11 

•      67 

29 

2 

152.7 

2 

12 

65 

21 

6 

150.7 

3 

13 

23 

7 

-3 

129.0 

7 

In  this  example  the  null  hypothesis  would  be  that  there  are  no  variety 
effects  on  the  size  of  the  cabbage  heads.  Table  2  shows  the  results  of  the 
analysis  of  variance  applied  to  Keuls'  data. 

Table  2.  Analysis  of  Variance  for  Table  1.  (X=  .05 

Source  of        :  :  Mean  Square : 

Variations t D/F    ;  Sample j Expected  t   F 

Varieties  12       1392.8        (t^  +  0>2$2l^      11.21* 

Blocks  2       UiO.75       (^2+l3(J-2_     ^^^^^^ 

Error  2li       12li.29        (^^ 


The  obseirwed  F-ratio  for  varieties  is  found  to  be  significant  so  the  null 
hypothesis  is  rejected.  Keuls  concluded  that  it  was  improbable  that  the 
variety  observed  means  form  a  random  sample  from  one  and  the  same  normal  ■ 
population.  The  next  problem  is  to  find  all  significant  differences  among 
variety  means  by  the  LSD  procedure,  v 


2h 

LSD  =  2.06U    |/l2U.29  (2/3)  =  18.8     ,     '  , 


where 


Table  3  is  constructed  as  an  ordered  array  by  listing  all  the  variety 
means  in  rank  order.  All  the  differences  between  means  have  been  computed. 
The  differences  which  are  less  than  LSD  =  18.8  are  underlined. 

Table  3.  Fisher's  LSD  Analysis  of  Keuls'  Cabbage  Data.  o(  =  ,o5 


Var      3     •     8     •     2     '     5     5  -Lo     : 

k     • 

• 

13     •     9     ' 

•  • 

•  • 

6     '' 

• 
• 

7     • 

• 
• 

12     ; 

• 

11     •     1 

• 
• 

yar  ,  97.7  ioo.7  111.3  120.7  12U.3 

iaean                         " " 

128.7  129.0  131.0  132.0 

liil.7  150.7  152.7  176.0 

3                   3.0    13.6    23.0    26.6 

31.0 

31.3    33.3 

3U.3 

hh,0 

53.0 

55.0    78.3 

0   -                        ID. 6    20.0    23.6 

28.0 

28.3    30.3 

31.3 

itl.O 

5o.o 

52.0    75.3 

'i.                                      9M    13.0 

17.U 

17.7    19.7 

20.7 

30.U 

39.ii 

Ul.U    6U.7 

>                                                  3.6 

B.O 

8.3    10.3 

11.3 

21.0 

30.0 

32.0    S^.3 

10 

k 

13 

li.ii 

U.7      6.7 

7.7 

17.1i 

26.ii 

28.U    51.7 

0.3      2.3 

3.3 

13.0 

22.0 

2U.0    ii7.3 

2.0 

3.0 

12.7 

21.7 

•23.7    ii7.0 

9 

6 

7 

12 

11 

i.b 

10.7 

19.7 

21.7    ii5.0 

9.7 

18.7 

20.7    Ui.O 

9.0 

11.0    3i;.3 
2.0     25.3 

23.3 

2S 


The  hypotheses,  ji^  =    ^^   ifriiich  are  accepted,  ^=  .05  are  therefore 


h'  n 

^  '    >^10 

^=   ^9 

=  K 

=    J'U 

=    -^10 

.   =  /^5 

f"^-  H 

=    >^11 

.      =    ^8 

■            "  h 

=  Hz 

=    -^10 

=    ^10 

=    -^13 

=    -^13 

=  ^ 

>^9  =    >^10 

y^'  H 

H--  Fi 

=  h.^ 

^h'  f"^ 

'    ^9 

>        >^10=    ^Xi 

=    ^^6 

=    ^10 

h.1  '    A2 

"  h 

=    ^12 

=    ^9 

'   A3 

' 

and  all  other  remaining  hypotheses,  /i^  =/i-  are  rejected, 

MULTIPLE-t  TEST 

A  test  procedure  very  similar  to  the  LSD  test  is  the  multiple-t.  It 
does  not  require  an  F-test  before  applying  the  LSD  procedure  to  all  hypotheses 
of  the  form  /^  =;ij»  ^or  all  i,  J.  The  multiple-t  test  will,  in  general, 
have  lower  probability  of  Type  II  error  than  the  LSD  test  because  even  though 
the  F-test  leads  to  acceptance  of  the  null  hypothesis,  the  maltiple-t  test 
may  detect  differences  among  means. 

This  test  could  be  applied  to  Keuls'  cabbage  exanple  from  the  preceding 
section;  but  because  F  for  varieties  was  significant,  the  results  of  this 
test  are  exactly  the  same  as  the  LSD  test. 


'■^cr^' 
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LIlffTATIONS  OF  LSD  AND  MULTIPLE-t  TESTS 

The  LSD  and  multiple-t  test  procedures  only  test  hypotheses  P±  ~    P-\> 
i  and  j,  1  to  n.  However,  in  some  experimental  situations  it  is  useful  to 
make  comparisons,  not  only  of  two-mean  groups  but  also  three-mean  groups, 
,  ,  .,  and  n-mean  groups.  Other  forms  of  decisions  also  may  be  desired, 
such  as  certain  linear  combinations  of  treatment  means.  In  addition  the 
LSD  and  multiple-t  tests  make  the  probability  of  committing  Type  I  errors 
somewhat  greater  than  the  specified  significance  level.  Therefore  several 
investigators  (l,  lU,  17)  have  proposed  other  test  procedures.  These  are 
usually  of  two  kinds:  multiple  range  tests  and  multiple-F  tests.  ^ 

MULTIPLE  RANGE  TESTS 

The  Student-Newman-Keuls  Test 

This  multiple  range  test  differs  from  the  multiple-t  test  in  that  the 
protection  level  for  a  group  of  n  means  is  fixed  at  (l  -  o(  )  for  all  p  -  mean 
comparisons,  p  =  2,  .  ,  .,  n.  Student  (l6)  first  suggested  using  the  quan- 
tity q  =  w/s,  to  determine  differences  among  treatment  means  where  w  is 

the  range  in  a  sample  of  n  observations  from  a  normal  population  with 

2  9 

standard  deviation,  (J^  ,  and  s  is  an  independent  estimate  of  (T"  •  Later 

Newman  (11)  modified  Student's  idea  presenting  a  table  which  was  con?)uted 
by  quadrature  from  Pearson's  (12)  approximate  probability  law  of  the 
studentized  range,  Keuls  (8)  developed  these  ideas  further.  The  Student- 
Newman-Keuls  test  is  called  a  multiple  range  test  because  the  over-all  pro- 
cedure involves  the  repeated  use  of  range  tests  on  the  p-mean  groups,  p  =  2, 
,  .  ,,  n.  This  method  is  summarized  by  the  following  rule:  The  difference 
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between  any  two  means  in  a  set  of  n  means  is  significant  provided  the 
range  of  each  and  every  subset  Tirtiich  contains  the  given  two  means  is 
significant  in  an  o( -level  range  test.  Federer  (U)  lists  the  steps  for 
following  this  rule: 

.  'Step   (I)  Subdivide  the  treatment  means  into  biological,  physical, 
or  sociological  groups.  Natural  groupsings  as  prescribed 
by  the  choice  of  the  particular  set  of  treatments  have 
meaning;  it  is  doubtful  if  a  ranked  set  of  means  from 
two  or  more  natural  groups  has  any  practical  significance. 
Step  (II)  Choose  a  significance  level,  0<  ,  which  usually  will  be 

the  5  or  1  per  cent  level. 
Step  (III)  Compute  the  standard  error  of  a  treatment  mean,  s-  ,  and 
the  values  W^  =  q  ^  ,  n^x  ,  W^_^  =  ^  e<  ,  n-1  ^x  '  *  *  '* 
^  =  q  C<  ,3^x  »  ^"d  Wg  =  q  ^  ^2^-  =  t  o(  ,f   r^s-  =  LSD. 
Rank  the  treatment  means  from  highest  to  lowest  x    , 

\-l'   •  •  •»  ^2  *  %  • 

Step  (IV)   Compare  the  range  of  n  treatments,  x   -  x^  with  the 

n     X 

calculated  W^  •   If  x^  -  x-j^  is  less  than  W  ,  the  process 

stops,  and  the  n-means  are  asserted  to  belong  to  a  non-. 

heterogeneous  group.  If  x   •■  %   ^   W   subdivide  the 

means  into  two  groups  of  n-1  means  each,  5c   to  x   and 

^n-1  *°  %  ^^  state  that  x^  is  different  from  jL .  Then, 

con5)are  the  range  x,  -  x,  and  x  ^  -'x,  with  W  ,. 

n     t      n-1    1       n-1 

If  either  range  is  less  than  W^  ^  the  means  in  the  group 
are  said  to  belong  to  a  single  group.  If  either  range 
exceeds  W^_^  the  n-1  means  are  divided  into  two  groups 
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of  n-2  means  each  and  conpared  with  W  p*  "^he  process 
continues  until  a  subset  of  means  is  obtained  which  does 
not  exceed  the  calculated  value  Wj^,  The  process  stops 
whenever  the  actual  range  of  the  subset  is  less  than  the 
calciilated  range.  No  subset  of  means  is  compared  if  the 
subset  is  included  in  a  larger  subset  which  is  less  than 
the  calculated  range  W^^," 
where 

n  ss  5onax  -  5anin  a     range 


"     Sx  Standard  deviation 

As  an  example,  consider  Keuls'  cabbage  data.  For  the  thirteen  variety- 
means,  twelve  W.  values  are  needed,  namely, 

s-  =  W  iikilZ  =  }j  U1.U3   =  6.ii37  with  2k   degrees  of  freedom 

=  .05 
¥„:,  13  =  5.18  (6.1i37)  =  33.3 
-         W^  ^  12  =  5.10  (6.U37)  =  32.8 
W^  ,  11  =  5.01  (6.U37)  =  32.2 
\=  lo'  ^'^^   (6.U37)  =  31.7 
\=  9  =  ^'^1  (6.U37)  =  31.0 
...    \^Q    =  li.68  (6.1i37)  =  30.1  ■  ■        .        .  :   '^ 
\=  7    "  ^'^^   (6.a37)  =  29.2 
W^^  ^  =  U.37  (6.B7)  =28.1 
\^  ^    =  h.n   (6.1i37)  =  26.8 
■  A»  li  =•  3.90  (6.1i37)  =  25.1 
^-3  -  3.53  (6.1i37)  =  22.7  and 
■"^n  =  2  =  2.92  (6.1;37)  -  18.8  «  Isd. 


■'W-'^' 
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The  thirteen  variety  means,  in  order  of  size,  are:   , 

Var.  3    8    2    5    10    h        13    9    6    7    12    11 

Mean  97TT  ioo77  iiT3  120.7  I2ir3'  I28T7  129.0  131.0  ij27o  mrrr  i5o77  i5?r7  IT^To 

The  first  step  requires  con5)aring  the  ^^    with  the  range  between  the 
largest  and  smallest  mean  until  the  observed  value  does  not  exceed  the 
calculated  value: 

5cj_  -  Xj  =  176.0  -  97.7  =  78.3  >  W^^  =  33.3  ■  V 

xj^  -  X8  =  176.0  -  100.7  =  75.3  >  W^^g  =  32.8 
'■■'.'.  i.   ^  -  %  =  152.7  -  97.7  =  55.0  >  W^^g  =  ^2.8 
.  X3_  -  xg  »  176.0  -  111.3  =  63.0  >^l^ii  =  32.2 
Xj^2  •  %  "  1^0*7  -  97.7  =  53.0  >W^^^  =  32.2 
Xj^^  -  Xg  »  152.7  -  100.7  =  52.0  >^^^i  »  32.2 


5Ej^2  -  ^2  =  1^0'7  -  111»3  =  39.U  >  W^^  =  31.0 
£j^2  -%=  150.7  -  120.7  =  30.0  <W^Q  =  30.1 

The  comparisons  continue  in  this  way.  The  result  of  comparisons  of  ranges 
between  variety  means  leads  to  acceptance  of  the  following  hypotheses  of 
the  form  u^  ■  u^. 
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/^2=    >P3 

-^U' 

^11 

^7=    -^9 

=    A 

3 

;^i2 

=    >^10 

=    >^5 

3 

/^13 

.■       '    >^11 

'  J's 

/^5     = 

^6 

=    -^12 

'    J'Q 

= 

;^7 

=    ;^13 

'    ^ 

3 

/^8 

>^8'    >^io 

=  ;^io 

S 

^9 

;^'  /^lo 

'   ^13      ■ 

S 

/^lO 

=    y^^ll 

>^3  =   >^5 

3 

/^12 

-    >>^12 

-  ^8 

= 

/^3 

=  ;^i3 

--   h.0 

H" 

/7 

-^10=    /^ll 

Ph'    ^5 

s 

>^9 

=    >^^12 

-   >^6 

s 

;^io 

=    /^.13 

=   /^7 

a 

/^n 

y'^ll"    >^12 

=    /9 

= 

/^12 

=    /^I3 

=    /^O 

s 

/^ 

;^12=    >^13 

and  all  other  remaining  hypotheses  u^  =  iXi   are  rejected. 
From  this  example  it  can  be  noticed  that  this  test  accepts  equalities  of 
variety  means  for  wider  ranges  of  sample  means  than  do  the  LSD  and  multiple-t 
tests, 

Duncan's  New  Multiple  Range  Test 

Duncan  ■  (1)  has  proposed  a  test  called  the  new  multiple  range  test 
(NliRT),  He  claims  that  this  test  is  an  optimum  procedure.  This  procedure 
is  intended  to  be  a  compromise  between  the  Student-Newman-Keuls  test  and  the 
multiple-t  test,  Duncan  attempts  to  strike  an  optimum  combination  of 
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probabilities  of  Type  I  and  II  errors  by  introducing  protection  levels 
Tnhich  vary  with  degrees  of  freedom.  The  values  of  his  so-called  shortest 
significant  range,  denoted  as  R  ,  is  smaller  than  the  values  used  in  any 
other  range  test  except  the  LSD  or  raultiple-t  provided  all  the  conditions 
are  the  same.  If  the  difference  between  any  two  means  exceeds  the  cor- 
responding R. ,  it  is  declared  to  be  significant  and  Hq(  Uj_  =  u^ )  is 
rejected  with  one  exception.  The  exception  is  that  no  difference  between 
two  means  can  be  declared  significant  if  the  means  concerned  are  contained 
in  any  subset  between  means  in  the  ordered  array  which  have  a  non-significant 
range.  For  example,  consider  five  treatment  means,  x, ,  X2,  Xo,  Xi  ,  and  x^j, 
and  assume  these  means  are  in  order  of  size.  Then  if  it  is  found  that 
x^  -  Xq^  <  R^,  H  (  Jij.  =  n^)   is  accepted;  and  no  other  differences  among 
means  between  x,  and  xt     can  be  considered  significant,  even  though  some 
difference  might  exceed  the  appropriate  R  ,  In  other  words,  it  is  not 
possible  to  make  decisions  such  as 

if  ^0^ /^l  ~  )^^^     ^^  been  accepted  and  the  sample  means  are  in  rank 
order  from  xl  to  x^, 

Duncan's  shortest  significant  reinges  are  computed  as  follows; 

R  a  8-  q*      , 
,  ,  .   P   X   oC,P 


where 


S-  "   the  standard  error. 
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q*    =  the  tabvilar  value  (l)  of  Duncan's  special  signi- 

f icant  studentized  ranges  with  the  0(  -level  of 

significance  and  p  the  nun±>er  of  means  in  the 

subset. 

The  shortest  significant  range,  which  is    R«  =  s-  q-«-  is  the 

*=    X   o(  ,2 

same  value  as  W  „  for  Student-Newman-Keuls'  test,  which  is  the  LSD, 
n=2 

In  the  case  of  n  means,  the  desired  number  of  shortest  significant 
ranges  is  (n-1).  These  R   values  increase  at  a  somewhat  slower  rate  than 
\'i.     in  the  Student-Newman-Keuls'  test.  Duncan's  shortest  significant  ranges 
are  computed  so  that  the  protection  level  for  a  group  of  n  means  is  not 
fixed  at  (l-  0(  ),  as  for  any  subset  of  means  in  Student-Neymian-Keuls' 
test,  but  is  (l-o(  )    ,  where  p  e  2,  3  .  .  .,  n.  It  is  Duncan's 
belief  that  this  protection  against  Type  I  errors  is  adequate,  and  his  NIET 
maintains  better  power  against  Type  II  errors  than  does  the  Student-Newman- 
Keuls'  test. 

The  data  from  Keuls'  experiment  are  used  to  illustrate  the  calculations 
for  Duncan' s  NMRT. 

The  standard  error  for  the  mean  is 


s-=  ^  12U.29/3  =  6.U37 


with  twenty-four  degrees  of  freedom.  The  calculated  least  significant 
ranges  are  computed  as  follows: 

R3^3  =  3.1i3  (6.U37)  »  22,1 

R^2  =  3.U1  (6.U37)  =-  22,0 

Rll  =»  3.I1O  (6.U37)  =  21,9 
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R   =  3.38  (6.U37)  =  21.8 
10 

R9  =  3.37  (6.U37)  =  21.7 
Rq  =  3.3U  (6.U37)  =  21.5 
Ry  -  3.31  (6.a37)  =>  21.3 

R^  =  3.28  (6.U37)  =  21.1 

"s. 

R^  =  3.22  (6.U37)  =  20.7  " 

Rj^  =  3.15  (6.U37)  =  20.3 

R-  =  3.07  (6.U37)  =  19.8 

R^  =  2.92  (6.U37)  =  18.8  =  LSD  .  . 

The  result  of  using  this  test  leads  to  acceptance  of  the  follovring 
hypotheses  u.  =  11.  concerning  variety  means: 


2k 


h'  n 

^    h--    ^9 

.      y^6  =  >13 

'  )'h 

=    .^10 

/^7=   ^9 

'    =  ^5       ' 

=  n^ 

=    >^10 

=  n 

i's"   ^6 

=    y^ll 

"  /8 

=  h 

"    /12 

•^   >^9 

"    >^9 

=    >^13 

=    >^10 

"  Ao 

/'9=    /^lO 

'  /13 

VA3       . 

=    A2 

/3=   .^8 

H"  f^l 

=   A3 

A '  ;^5 

"  h 

Ao  °   /l3 

=   H 

"    /lO 

/ll  =  A2 

'   ^7 

=    A2 

and  all  other  remaining  hypotheses,  ]^±  -  p^-     are  rejected. 

Tukey's  Test  Based  on  Allowances 

,  Several  multiple  comparison  procedures  have  been  introduced  by  J,  W. 
Tukey  (17).  However,  this  report  deals  with  only  one  of  his  procedures  based, 
on  "allowances".  When  there  are  only  two  treatments,  an  "allowance"  is 
the  same  as  the  LSD  =  ts-lfT".  However,  when  there  are  more  than  two  treat- 
ments, the  test  based  on  allowances  becomes  a  multiple  range  test  and  an 
"allowance"  is  equal  to  the  value  of  W^^^^  obtained  in  the  Student-Newman- 
Keuls'  multiple  range  test.  This  value  is  called  an  hsd  (honestly  significant 
difference)  and  if  two  means  (or  groups  of  means)  differ  by  more  than  hsd 
thqr  are  said  to  differ  significantly.  This  procedure  may  also  be  used  for 
finding  confidence  interval  for  the  difference  between  any  two  means. 

In  his  paper  Tukey  (17)  discussed  the  experimenter's  desire  to  examine 
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all  contrasts  between  treatment  means j  not  only  simple  coiiparisons,  i.e., 
differences  between  pairs  of  treatment  means.  His  error  rate  is  on  a  per 
experiment  basis  rather  than  on  a  per  decision  basis  for  these  comparisons. 
He  felt  that  it  would  be  impractical  to  set  an  (X -level  significance  test 
for  each  of  the  comparisons  because  the  accumulated  total  errors  over  all 
comparisons  among  the  n  treatments  would  be  too  high.  This  is  the  objection 
to  the  LSD  test  most  often  found  in  the  literature. 

The  value  of  hsd  is  33.3  for  testing  differences  between  variety  means 
in  Keuls'  cabbage  experiment.  There  are  a  fewer  number  of  rejections  of  the 
hypothesis  ji^  =  n^  for  Tukey's  procedure  than  for  ary  other  test  so  far 
discussed. 

In  1953>  5!ukey  (l8)  proposed  another  multiple  range  test  procedure 
with  a  less  conservative  attitude  towards  Type  I   error  than  in  his  previous 
test.  In  this  procedure  the  significant  ranges  are  midway  between  the  ones 
required  by  the  test  based  on  allowances  and  those  by  the  Student-Newman- 
Keuls*  test, 

MULTIPLB-F  TESTS 

The  fflultiple-F  test  consists  of  the  combined  use  of  range  tests  and 
results  of  significant  F  tests.  Duncan's  (2)  multiple  con5)arison  test  and 
Scheffe's  (13)  test  are  generally  recognized  as  representative  of  this 
procedure. 
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Duncan's  Multiple  Comparison  Test 

Duncan  (2)  proposed  a  multiple  comparison  test  in  1951,  vrhtch   he  des- 
cribed as  a  "Multiple-F  Test".  It  was  a  compromise  between  two  rules. 
These  he  defined  as:  "Rule  1  is  the  difference  between  any  two  means  in 
a  set  of  n  means  is  significant  provided  the  variance  of  each  and  every 
subset  which  contains  the  given  means  is  significant  according  to  an  o( 

TV   T 

-level  F  test  where  o(n  =  '^  -  T-n  >   where  y^    =  (1  -   o(  r       ,   and  p  is  the 
number  of  means  in  the  subset  concerned.  Rule  2  is  any  comparison  of  the 
form  0=5*    k.  x.  is  significantly  different  from  zero  provided  the 
varisince  of  each  and  every  subset  which  contains  all  of  the  means  involved 
in  c  is  significant  according  to  an  CK     -level  F  testj  and  provided,  also, 
that  c  differs  significantly  from  zero  according  to  an  Of  -level  t-test, 
where  c  =Z._   i\  ^±)     and  k^,  k2,  .  .  .,  k^  is  any  set  of  arbitrary  con- 
stants such  that  S.   ,  k.  =0."  Rule  1  is  similar  to  the  method  described 
•^x=l  1 

for  Fisher's  (5)  LSD  test  except  that  Duncan  used  an  o<  -level  instead  of 
an  0< -level.  IHincan's  compromise  should  be  interpreted  so  that  as  many 
significant  differences  as  possible  are  found  by  Rule  1. 

Rule  2  is  then  used  to  test  any  comparisons  within  subsets  of  means 
already  found  to  contain  significant  differences  by  Rule  1, 

Duncan's  (2)  multiple  conparison  test  can  be  summarized  in  the  following 
four  steps: 

Step  1  List  treatment  means  ranked  in  order,  e,g,. 


x^  <  X2  <  ,  .  .<Xn  , 
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step  2  Determine  significant  ranges  from 


I 
R  =  s-  q     „  , 

P    X  ^  o(  ;p,f  ' 


where 


s-  =  estimated  standard  error  of  a  mean 

q    ,  =  the  tabular  value  (li)  with  cX -level  significance.  These 

values  are  different  from  q*  _     for  the  range  test. 

0(  JP>1 

p  =  the  number  of  means  in  a  subset 

f  =  degrees  of  freedom  associated  with  s-  . 

Step  3  Determine  a  set  of  least  significant  sums  of  squares  and  of 
the  sumes  of  squares  among  certain  combinations  of  means 

ss_,  =  1/2  r'  ^   . 
Pp. 

These  values  are  compared  with  sums  of  squares  among  means. 
For  example  if  there  are  three  means:  x^.,  X2,  and  x^,  con^jute 
the  s\ims  of  squares 

SS'    -  .  =  Xn   +  Xo   +  X 


=  7.2.7.^.-^.  (^  +  5E2  +  X3) 


1,2,3  -  n  -^  ^2 
and  con5)are  with  1/2  R  =  ss  , 

Step  k  This  step  is  used  only  in  certain  cases  when  the  sample  range 
for  all  means  in  the  group  is  less  than  R»p,  and  the  observed 
sum  of  squares  among  means  is  larger  than  the  computed  sum  of 
squares  for  the  number  in  the  group.  The  n-1  degrees  of  freedom 
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are  partitioned  into  single  degree  of  freedom  contrasts. 
The  comparison  or  comparisons  contributing  to  the  signi- 
ficant sum  of  squares  are  segregated  as  in  Step  3.  This 
i§  n^et\)l  yfhen  tef^Un^;   the  fflgnlflcencf  of  a  eowparieon  in- 
volving (groups  of  iioans. 
The  first  step  is  conducted  ordinarily  as  a  range  test.  In  order  to 
con^jute  the  significant  range  for  n  means  in  Step  1,  Duncan  uses  the  relation 

Dvmcan  (3)  recently  published  a  new  procedure.  It  is  called  the 
Minimum  Average-Weight-Risk  Analysis,  and  is  based  on  Bayes'  theorem  and 
some  recent  ideas  of  Lehmann  (9),  This  new  procedure  is  similar  in  concept 
of  the  multiple-t  test,  Duncan  is  still  conducting  research  on  this  problem. 

Scheffe's  Test 

In  19^3,  Henry  Scheffe  (13)  introduced  a  test  procedure  based  on  linear 
contrasts  which  include  a  wide  variety  of  treatment  comparisons.  This  test 
procedure  may  be  described  as  an  F-test  analogue  of  Tukey's  (l8)  test  based 
on  allowances.  This  multiple  comparison  test  is  defined  by  Scheffe  as: 
"A  kind  of  simultaneous  interval  estimation  and  multiple  significance  test. 
For  testing  the  hypothesis 

"o-  n^/'a  =  •  •  •  =A- 

a  contrast  among  the  parameters  jx^,    ^,  .  .  .,  ji  ,     is  defined  to  be  a 

linear  function  of  the  ^'s:    :£^      c^  =   0.  This  statement  is  similar  to 

Rule  2  in  Duncan's  multiple  comparison  test  because  both  are  linear  combinations 
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of  the  means, 

Scheffe  (ll;)  stated  the  follovdng  theorem;  "The  probability  is  (l-C<) 
that  the  values  of  all  contrasts  simultaneously  satisfy  the  inequalities 

where      ^  =  ^_^  ""l   H     > 

<^  =  2    c.  5c   (unbiased  estimate  of -5^), 
/      i=»l  1   J- 

(J^%  =      variance  of -^ 

S  =  ]}    (n  -  1)  F^   ~~7"  , 

»n-l  =  degrees  of  freedom  for  parameters,  and 

f  =  degrees  of  freedom  for  the  error  variance. 

In  the  general  case  for  ajiy  linear  functions,  i.e.,  no  restriction  on 
^c^,  the  same  confidence  interval  is  used  as  in  the  above  case,  and  the 
same  probability  statement  is  applicable,  but  the  value  of  S  becomes 


where  q  is  generally  the  number  of  means, 

Scheffe  himself  admitted  this  method  is  undesirable  for  contrasts 
of  the  type  p±  =    }^^     because  it  gives  rather  wide  intervals  compared  to 
other  methods.  Therefore,  this  test  is  capable  of  accepting  the  null 
hypothesis  \i  f^^  =    /ij)  too  often. 

A  careful  study  of  Scheffe' s  test  will  show  that  it  is  built  similar 
to  Tukey's  test  with  allowances,  which  also  is  based  on  confidence  intervals. 
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GRAPHICAL  COMPARISON  OF  TEST  PROCEDURES 

The  purpose  of  shovrLng  Fig.  2  is  to  summarize  schematically  the  com- 
parison of  the  results  of  various  test  procedures  applied  to  the  Keuls' 
cabbage  experiment.  Here  only  five  procedures  are  compared  namely: 
Tueky's  test,  Duncan's  new  multiple  range  test,  Student-Newman-Keuls '  test, 
LSD  test,  and  multiple  t-test.  Among  these  tests  Tukey' s  test  and  S-N-K 
test  are  much  alike  in  that  they  accept  more  equalities  among  variety  means 
than  the  three  other  tests,  Tukey' s  test  is  the  only  one  that  decalres 
Variety  1  equal  to  Varieties  11  and  12,  The  LSD  test  and  Duncan's  NMRT 
detect  differences  and  accept  equality  among  the  means  in  reasonably  the 
same  way  except  for  a  little  disagreement  in  the  middle  of  the  range  of  . 
variety  means, 

EMPIRICAL  STUDY  OF  MULTIPLE  COMPARISON  TESTS 
Results  of  Monte  Carlo  Study 

The  purpose  of  this  part  of  the  study  is  to  attain  a  practical  evaluation 
of  three  of  the  multiple  comparison  test  procediires  described  above.  Spec- 
ifically, interest  is  focused  on  determining  the  power  and  the  protection 
level  of  three  test  procedures j  namely,  Fisher's  LSD,  multiple-t  and  Dun- 
can's new  multiple  range  test.  In  this  study,  the  power  and  the  protection 
level  were  determined  separately  by  using  different  combinations  of  means 
with  different  variances. 

In  order  to  conduct  this  study  the  following  three  stages  were  required: 
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Stage  1  Random  samples  of  size  n  =  10  were  dra-vm  from  populations 

with  known  means  and  variances.  The  population  means  ranged 
from  5.0  to  13.0  and  their  variances  were  either  U  or  16, 
Stage  2  Step  1  was  repeated  100  to  500  times  depending  on  the  number 
of  decisions  desired  for  chosen  sets  of  means  and  variance. 
An  analysis  of  variance,  F-test,  and  Duncan's  shortest  sig- 
nificant difference  for  each  set  was  obtained.  The  specific 
combination  of  means  and  variances  are  shown  in  Tables  k   and  5. 
A  high-speed  computer  was  used  draw  samples  and  to  perform  com- 
putations for  Stage  1  and  2, 
Stage  3  All  the  differences  betvreen  means  were  computed,  and  these  dif- 
ferences were  compared  to  corresponding  values  of  LSD  and  the 
Duncan's  Rp-s;-s.  The  results  are  illustrated  by  an  example, 
given  in  analysis  of  data. 
For  the  first  situation  in  Table  5,  a  set  of  means:  5,5,5,7,7  vrith 
variance  equal  to  h  was  used.  The  total  number  of  decisions  made  was  ii600, 
since 

(  27"  ^^^"^^  "   10  decisions  for  each  of  U60  sets  of  samples. 


Tukey's  Test 


r 


Duncan's  NMRT 
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1 

S  -  N  -  K  Test 
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LSD  Test  and  E.-t  Test 
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Fig,  2.  The  variety  means  within  a  bracket  are  asserted  to  be  not  hetero- 
geneous, and  means  not  bracketed  together  are  asserted  to  be  dif- 
ferent. 


■  i.ne---:»^- 
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Analysis  of  Data 


Case  N(  5,5,  6,6,  7,7,  8,8,  9,9jU) 
^.05,9,90  =  l-98j  F=  6.70* 

Rp*  =  1.69,  1.77,  1.83,  1.87,  1.90,  1.93,  1.95,  1.98,  1.99 
Isd 

Ordered  Array  of  Means 

12563U789        10 

r:2o  jj^  FiBo  53J  5:bf  tjb  kjz  or  ht^h  9792 

1  1.68    1.98     2.96    3.27    3.38    1.72 

2  1.78    2.76    3.07    3.18    U.52 
5                                                            1.38     2.36    2.67     2.78    U.12 

3  1.28    1.9U     2.05    2.39 
u                                                                     ...  1.29    1.70    3.0U 

7  l.liO    2.76 

8  1.76 

9  1.3U 

10  .  ■ 


DECISIONS 

Hq       LSD 

NMRT 

MT 

■    Ho 
/i2=P9 

FLSD 
R 

NMRT 
R 

MT 
R 

Ho 

LSD 

NMRT 

MT 

• 

n=P2   ■ 

FrFS 

I'd 

;^i=B 

F2=P10 

R 

R 

R 

YTFi 

R 

R 

R 

;ii=;iii  R 

R 

R 

P3=Fh 

FTH 

R 

R 

R 

;^i=>^5 

h^F^ 

FTF9 

R 

R 

R 

}'r}'6 

^'tH 

Fr?10 

R 

R 

R 

?1=F7    R 

R 

R 

?yYi 

F6=F7 

;il=;i8    R 

R 

R 

F3=F8 

F6=FQ 

R 

R 

R 

;^i=;^9  R 

R 

R 

?3'=?9 

R 

R 

F6=F9 

R 

R 

R 

;ii=;iio  R 

R 

R 

FyFio 

R 

R 

R 

F6=Fl0 

R 

R 

R 

}^=F3 

FlrF^ 

j^ry'Q 

}i^=)xi^    R 

R 

Fh=H 

)'r?9 

-        ■.■-:       4 

/^2=/^5 

FlrFl  ■ 

Fr?io 

R 

R 

•     --   '-  .■^ 

>^=;i6 

FIrFQ 

FQ=F9 

■  '  '•'i^ 

F2=F7    R 

R 

R 

^iTi"? 

FQ=no 

■i 

;^2=/^8    R 

R 

R 

Fh'-J'io 

R 

R 

R 

P9=}^10 

NO.   CORRECT  EQUALITIES  ^-^-S  NO  CORRECT  INEQUALITIES  21-18-21 
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Table  U.  Smnmary  of  protection  levels  for  indicated  sampling  situations, 
as  obtained  by  Monte  Carlo  studies.  All  samples  were  of  size 
n=10  from  each  population. 


: 

:  ^Correct  Decisions 

• 

Number  of 

:  When  Equality 

True 

Experimental  Situation       : 

Decisions 

:  FLSD  :  Mt  : 

NLRT 

N(5,5,5,7,7;  h) 

I8ii0 

97    97 

98 

N(5,5,5.5,5.5,6,6,6.5,6.5, 

7,7;  h) 

i5oo 

96         96 

97 

N(5,5,5,5,5,5,7,7,7,7j  U) 

7035 

95    95 

98 

N(5,5,6,6,7,7,8,8,9,9j  h) 

i5oo 

9h         9k 

96 

N(5,5,5,5,5. 5,5.5,5.5,5.5,6, 

6,6,6,6.5,6.5,6.5,6.5,7, 

7,7,7;  U) 

1500 

9h         9h 

98 

N(5, 5, 7, 7, 9,9,11, 11, 13, 13;  U) 

i5oo 

96         96 

96 

N(5,5,5,7,7;  16) 

1128 

98    92 

92 

N(5,5,5.5,5.5,6,6,6.5,6.5, 

7,7;  16) 

1370 

98    95 

97 

N(5,5,5,5,5,5,7,7,7,7;  16) 

2100 

97    9k 

9U 

N(5,5,6,6,7,7,8,8,9,9;  16) 

500 

93    92 

95 
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Table  5. 


Suimary  of  powers  for  indicated  sampling  situations,  as  obtaiiied 
by  Monte  Carlo  studies.  All  samples  were  of  size  n=10  from  each 
population. 


Experimental  Situation 


N(5,5,5,7,7,j  h) 
N(5,5,5.5,5. 5,6,6,6.5,6.5, 

N(5,5,5,5,5,5,7,7;7;7j  h) 
,^&5,6,6, 7,7,8,8,9,9;  h) 

6,6,6,6,6,S,6.5y6.S,6.S, 
,^   7,7,7,7;  U) 

N(5,5.5,6,6.5,7.7.5,8,8.5i 

N(5,6,7,8,9;  ii) 
N(5,7,9,ll,13;  li) 

N(5,5,7,7,9,9,ll,ll,13,13;  \x) 
N(5,5,5,7,7;  16) 

N(5,5,5.5,5.5,6,6,6.5,6.5, 

7,7;  16) 
mM^?'?'^'^'I'7,7,7;  16) 
N  5,5,6,6,7,7,8,8,9,9;  16) 
N(5,6,7,8,9;  16) 
N(5,7,9,ll,13;  16) 


Number  of 
Decisions 


2760 


8000 


^Correct  Decisions 
If  Inequality  True 
FLSD  :  Mt  :   NMRT 


51   55 


25    25 


53 


12000 

18 

22 

16 

80ii0 

% 

58 

U9 

12000 

5U 

5Ii 

ii9 

15 


8910 

liS 

li8 

kk 

7030 

52 

52 

51 

3960 

83 

83 

83 

12000 

83 

83 

81 

1692 

11; 

19 

17 

10960 

3 

9 

5 

2U00 

11 

21 

13 

liOOO 

20 

2U 

18 

ii5io 

16 

22 

18 

3850 

Ik 

Ik 

Ik 
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Y/ithin  this  set  of  10  comparisons  there  are  h   equalities  and  6  inequalities 
so  that  total  equalities  =  li  x  ]460  =  I81i0  and  total  inequalities  =  6  x  hSO  = 
2760. 

The  protection  levels  (Table  h)   and  powers  (Table  5)  are  computed  in 
the  following  way,  as  for  example  was  done  for  N(5,  5,  7,  7,  9,  9,  n, 
H*  13,  13;  i;).  A  total  of  300  sets  were  used,  hence 

total  number  of  decisions  =  hS  x  300  =  13500 
total  number  of  true  equalities     =  l5oO 
total  number  of  true  inequalities    =  12000 
The  folloYdng  are  the  numbers  of  incorrect  decisions  among  the  l500 
decisions  possible  on  equalities: 

ISD  MT  NMRT  / 

Total  No.     6U  62  61; 

Percentage:  6U/l$00  =  U.27  62/l500  =  ii.l3     6U/l500  =  li.27 
Therefore,  the  per  cent  of  correct  decisions  for  the  situation  is  96  per  cent, 
to  the  nearest  whole  per  cent. 

The  numbers  of  correct  decisions  for  inequalities  are 

LSD        .  .  MT   .  NMRT 


Total  No.:   99li3        9765  99I43 

%  =  Total  N0./13500  82.86       81.37  82.86 

Comments  on  the  Empirical  Results 

From  Table  (h)   and  Table  (5)  the  multiple-t  test  has  power  superiority 
as  predicted  over  the  other  two  tests.  Fisher- s  LSD  test  has  a  power  advantage, 
in  most  situations,  over  Duncan's  NMRT.  For  the  Type  I  error,  the  multiple-t 
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test  gives  a  slightly  higher  percentage  of  errors  than  the  other  two  tests, 
which  is  not  serious.  For  Fisher's  LSD  test  and  Duncan's  NMIT,  the  pro- 
babilities of  committing  Type  I  errors  are  variable  from  one  situation  to 
another. 

Empirically,  therefore,  the  multiple-t  test  has  better  power  than  either 
of  the  other  tests,  and  is  simpler  and  more  convenient  to  apply.  If  one  fears 
that  its  Type  I  error  rate  is  too  high — ^which  is  not  confirmed  herein — one  can 
use  Fisher's  LSD  and  maintain  (l-o^)  protection  against  Type  I  errors  on 
the  n-mean  decisions. 
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This  report  is  an  attempt  to  compare  and  contrast  various  multiple 
comparison  tests  developed  by  statisticians  and  mathematicians  since  1927. 
Not  only  are  the  test  procedures  illustrated,  but  also  a  discussion  of  the 
procedures,  emphasizing  their  individual  advantages  and  limitations,  has 
been  attempted. 

Whenever  a  statistical  experiment  is  conducted  -which  is  intended  to 
compare  treatment  effects  of  various  sorts,  the  experimenter  hopes  to  deter- 
mine which  treatments  are  equal  and  which  are  unequal,  on  the  average  with 
respect  to  the  measurement  taken.  Essentially,  the  experiment  would  decide 
whether  or  not  the  samples  came  from  the  same  population.  Usually  the  popu- 
lation parameter  of  most  interest  is  the  mean.  If  the  treatment  means  are 
different  from  one  another,  it  is  of  interest  to  know  which  means  differ, 
and  what  are  the  magnitudes  of  these  differences.  These  questions  can  be 
answered  by  some  of  the  methods  of  multiple  con^sarison. 

A  discussion  of  the  concept  of  multiple  decisions,  protection  level    '^ 
against  Type  I  error,  and  power  of  a  test  precedes  the  descriptions  of  the 
various  testing  procedures. 

The  following  test  procedures  are  discussed:  Fisher's  LSD  test,  the 
multiple-t  test,  the  Student-Newman-Ke\d.s'  test,  Duncan's  new  multiple  range 
test,  Tukey' s  test  based  on  allowances,  Duncan's  multiple  comparison  test, 
and  Scheffe's  test.  These  test  procedures  were  felt  to  be  representative 
in  their  method  of  attacking  the  problem  of  multiple  comparisons.  They  dif- 
fer from  each  other  primarily  in  the  relative  importance  assumed  for  errors 
of  the  first  and  second  kinds.  The  underlying  assumptions  are  usually  nor- 
mality and  homogeneity  of  variance. 

Some  results  of  some  Monte  Carlo  studies  of  three  multiple  conparison 


test  procedures  are  reported.  The  three  test  procedures  considered  were: 
Fisher's  LSD  test,  the  multiple-t  test  and  Duncan's  new  multiple  range  test. 
These  tests  were  conpared  for  protection  against  Type  I  error,  and  with 
respect  to  their  powers  against  Type  II  error  for  a  number  of  known  sampling 
situations  in  which  differences  among  the  population  means  were  known  to 
exist.  Most  discussions  in  the  literature  seem  to  overenphasize  avoidance 
of  Type  I  error,  when,  in  fact,  most  experiments  are  conducted  after  an  at- 
tenpt  has  been  made  to  create  real  differences* 
It  was  found  that: 

a)  No  test,  even  Wisher's  LSD,  had  poor  protection  against  the  sort 
of  Type  I  error  studied, 

b)  The  powers  of  the  three  tests  studied  generally  were  in  the  order; 

Multiple-t   Fisher's  LSD   Duncan's  NMRT. 
The  latter  conclusion  only  verifies  Duncan's  own  statements  but  also  the 
conclusion  a)  seems  to  be  contraiy  to  the  fears  usually  expressed  in  the 
literature. 


